docs(operations): add containerized GPU workloads guide by lexfrei · Pull Request #555 · cozystack/website

Aleksei Sviridkin (lexfrei) · 2026-05-28T17:57:34Z

What this PR does

Add a new operations guide describing the container variant of cozystack.gpu-operator — the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver and nvidia-container-toolkit via the distro package manager.

The new page lands at content/en/docs/next/operations/gpu-container-workloads.md and rounds out the GPU documentation surface:

Running VMs with GPU Passthrough — VFIO passthrough of whole GPUs to KubeVirt VMs (default variant).
GPU Sharing with HAMi — fractional GPU sharing in tenant Kubernetes clusters.
Running Containerized GPU Workloads — this page. Containerized GPU workloads on management nodes (container variant).

Content covers when to pick the variant (host driver + host toolkit + a containerd-registered nvidia runtime prerequisite), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override on a stock apt install), the Talos caveat with a pointer to the examples/values-native-talos.yaml reference, install steps with Package CR variant: container, a sample CUDA pod for verification, why stacking HAMi directly on this variant is not supported yet, and a three-row variant comparison matrix.

Companion to cozystack/cozystack#2766, which adds the container variant itself.

Release note

docs(operations): add guide for containerized GPU workloads via the gpu-operator `container` variant.

Summary by CodeRabbit

Documentation
- New guide for running containerized GPU workloads on cluster nodes: prerequisites, installation via the Package CR, explicit warning against using bundles.enabledPackages for this variant, operator health and GPU allocatable verification, sample CUDA Pod workflow, fractional GPU sharing via HAMi, and a comparison of container, default (VM passthrough), and vGPU variants.

netlify · 2026-05-28T17:57:40Z

✅ Deploy Preview for cozystack ready!

Name	Link
🔨 Latest commit	`f2ae9b7`
🔍 Latest deploy log	https://app.netlify.com/projects/cozystack/deploys/6a26cec5f38a100008e4fbb0
😎 Deploy Preview	https://deploy-preview-555--cozystack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2026-05-28T17:57:44Z

📝 Walkthrough

Walkthrough

Adds a new operations guide documenting how to run containerized GPU workloads on Cozystack management nodes using the cozystack.gpu-operator container variant, including prerequisites, Package CR installation, health checks, CUDA smoke-test, HAMi fractional-sharing notes, and a variant comparison table.

Changes

GPU Container Workloads Documentation

Layer / File(s)	Summary
GPU container variant guide `content/en/docs/next/operations/gpu-container-workloads.md`	New operations guide explains when to use the container variant (host has NVIDIA driver and `nvidia-container-toolkit`), installation prerequisites, Package CR setup with warnings against `bundles.enabledPackages`, operator health verification, `nvidia.com/gpu` allocatable checks, a CUDA smoke-test Pod example, HAMi fractional-sharing guidance, and a variant comparison table.

Possibly related issues

cozystack/cozystack#2764: Directly addresses the same cozystack.gpu-operator container variant documentation and configuration guidance referenced in this PR.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped the docs to share the way,
Containers meet GPUs by light of day,
Drivers checked, CUDA pods take flight,
HAMi whispers fractional delight,
A tiny guide to make workloads play.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'docs(operations): add containerized GPU workloads guide' directly and clearly summarizes the main change: adding a new documentation page for containerized GPU workloads, which matches the added content perfectly.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/gpu-container-workloads-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.

gemini-code-assist · 2026-05-28T17:58:19Z

+kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
+  -p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'


In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.

Suggested change

kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \

-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\

-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

gemini-code-assist · 2026-05-28T17:58:20Z

+apiVersion: cozystack.io/v1alpha1
+kind: Package
+metadata:
+  name: cozystack.gpu-operator
+spec:
+  variant: container


The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.

Suggested change

apiVersion: cozystack.io/v1alpha1

kind: Package

metadata:

name: cozystack.gpu-operator

spec:

variant: container

apiVersion: cozystack.io/v1alpha1

kind: Package

metadata:

name: cozystack.gpu-operator

namespace: cozy-system

spec:

variant: container

coderabbitai · 2026-05-28T18:42:37Z

Actionable comments posted: 0

myasnikovdaniil

Thanks — this is a well-researched page and most of it checks out against the companion PR cozystack/cozystack#2766 and the platform chart. A few substantive items before merge.

Main blocker: the Fractional GPU sharing section directs users into a device-plugin registration conflict (see inline comment). HAMi does not reuse the operator's device plugin — it ships its own, and the auto-disable that prevents the clash only exists in the tenant kubernetes app chart, not on the management cluster. The container variant pins devicePlugin.enabled: true, so stacking cozystack.hami on top as written runs two plugins both registering nvidia.com/gpu.

Sequencing: cozystack/cozystack#2766 (which adds the container variant) is still open. This page documents a variant that doesn't exist yet — please hold merge until #2766 lands, or confirm both ship in the same release train.

Smaller accuracy/UX fixes inline. Recommendation: request changes.

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+## Fractional GPU sharing
+
+The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled.


⚠️ This HAMi claim is incorrect and would lead users into a resource conflict.

"HAMi reuses the same device plugin layer" is wrong. HAMi ships its own device plugin + scheduler extender. The page you link to states the opposite: "When HAMi is enabled, GPU Operator's built-in device plugin is automatically disabled to avoid resource registration conflicts."

That auto-disable only lives in the tenant kubernetes app chart (packages/apps/kubernetes/tests/gpu_operator_hami_test.yaml — "should disable devicePlugin when hami is enabled"). The management-cluster cozystack.hami PackageSource only declares dependsOn: cozystack.gpu-operator (install ordering); packages/system/hami/values.yaml does not touch the operator's device plugin.

The container variant pins devicePlugin.enabled: true (values-container.yaml in #2766). Stacking cozystack.hami on top, as written, runs two device plugins both registering nvidia.com/gpu — exactly the conflict the HAMi doc warns about.

Suggested rewrite:

The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. For fractional sharing (per-pod memory and compute quotas), see [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) — currently documented for tenant Kubernetes clusters, where enabling HAMi automatically disables the GPU Operator's built-in device plugin to avoid resource-registration conflicts. Stacking the `cozystack.hami` package directly on top of the `container` variant on the management cluster is not a supported combination yet: the variant pins the NVIDIA device plugin on, and running it alongside HAMi's device plugin causes both to register `nvidia.com/gpu`.

The intro at line 10 ("you can stack HAMi on top once the container variant is up") echoes the same claim and should be softened to match.

myasnikovdaniil · 2026-06-08T09:37:55Z

+## Prerequisites
+
+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.


The companion PR's own OS-support table (docs/gpu-vgpu.md in #2766) only covers Ubuntu 20.04–26.04 and Talos. Cozystack's documented node-OS surface is Talos + Ubuntu/Debian (ansible path). Listing RHEL/Fedora/openSUSE as "supported" presents untested territory as fact.

- The GPU node runs Ubuntu or Debian with the NVIDIA driver installed via the distro package manager (other distros with an equivalent driver + toolkit package layout should work the same way but are not regularly tested). Verify with `nvidia-smi` …

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
+- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).


apt install nvidia-container-toolkit alone does not modify containerd config — registration is a separate manual step. A reader on a fresh node will fail this grep with no pointer to the fix. Suggest spelling out the registration:

- `nvidia-container-toolkit` installed on the same node and registered with containerd: ```bash sudo nvidia-ctk runtime configure --runtime=containerd sudo systemctl restart containerd grep nvidia /etc/containerd/config.toml # must show the runtime entry

myasnikovdaniil · 2026-06-08T09:37:55Z

+
+```bash
+kubectl apply -f cuda-smoke.yaml
+kubectl logs cuda-smoke


Run back-to-back, kubectl logs errors while the (large) CUDA base image is still pulling. Add a wait:

kubectl apply -f cuda-smoke.yaml kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/cuda-smoke --timeout=5m kubectl logs cuda-smoke

myasnikovdaniil · 2026-06-08T09:37:55Z

+- A Cozystack management cluster with at least one GPU-enabled node.
+- The GPU node runs a supported Linux distribution (Ubuntu, Debian, RHEL, Fedora, openSUSE) with the NVIDIA driver installed via the distro package manager. Verify with `nvidia-smi` over SSH or `kubectl debug node/<node-name>` — it must enumerate the physical GPUs and report a working driver version.
+- `nvidia-container-toolkit` installed on the same node and registered with containerd (`grep nvidia /etc/containerd/config.toml` shows the runtime entry).
+- `kubectl` configured against the management cluster.


Minor gotcha worth one prerequisite line: the container variant relies on the upstream default workload container for unlabeled nodes. A node still carrying nvidia.com/gpu.workload.config=vm-passthrough from the GPU Passthrough guide overrides that per-node and the device plugin won't serve it — a likely trip-up when migrating a node off the passthrough setup.

- The GPU node must not carry a `nvidia.com/gpu.workload.config` label left over from the passthrough setup (`kubectl label node <node-name> nvidia.com/gpu.workload.config-` to remove).

Document the new container variant of cozystack.gpu-operator, paired with cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit Linux shape that the variant targets: when to pick it over the passthrough and vGPU variants, prerequisites (host driver + host nvidia-container-toolkit registered with containerd via nvidia-ctk runtime configure, validated with nvidia-smi over kubectl debug), the host-driver reuse path (driver.enabled=false, so the operator uses the pre-installed driver at its standard location with no driverInstallDir override needed on a stock apt install), the Talos caveat with a pointer to the values-native-talos.yaml reference, install steps, a sample CUDA pod for verification, the variant comparison matrix, and a note on why stacking HAMi directly on the container variant on the management cluster is not a supported combination yet (both register nvidia.com/gpu). Lands under operations/ — symmetric with virtualization/gpu.md (VM passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi in tenant Kubernetes addons). Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

Aleksei Sviridkin (lexfrei) · 2026-06-08T14:24:11Z

Thanks — addressed in the latest push.

HAMi (the blocker) — rewritten. You're right: HAMi ships its own device plugin, the operator-device-plugin auto-disable lives only in the tenant kubernetes app chart, and sources/hami.yaml only declares dependsOn for ordering. The page now says stacking cozystack.hami directly on the container variant on the management cluster is not supported yet (both register nvidia.com/gpu), and the intro line is softened to match.

OS support — narrowed to Ubuntu/Debian as tested; RHEL/Fedora/openSUSE are no longer presented as supported, just "should work but not regularly tested."

containerd registration — spelled out with the explicit nvidia-ctk runtime configure --runtime=containerd + restart + grep block.

Leftover nvidia.com/gpu.workload.config label — added as a prerequisite with the removal command.

CUDA smoke pod — added kubectl wait --for=jsonpath='{.status.phase}'=Succeeded before kubectl logs.

Validator path — same reframe as the code PR: dropped /host/usr/bin/nvidia-smi, now "host driver at its standard location, no driverInstallDir override on apt."

On the bot's namespace suggestions (-n cozy-system / namespace: cozy-system on the Package CR): left out deliberately — Cozystack's own canonical examples (packages/core/installer/example/platform.yaml, examples/values-native-talos.yaml) create Package CRs with no namespace, so adding one would diverge from the shipped convention. The current doc uses kubectl apply -f, not kubectl patch, so that suggestion doesn't apply either.

Sequencing: agreed — this should land with / after cozystack/cozystack#2766. The page is in the next/ tree so it tracks the unreleased variant.

myasnikovdaniil

NOT LGTM — the practical advice in the bundles.enabledPackages warning is right, but its stated failure mechanism is factually wrong and will mislead operators.

Business context: documents the container variant of cozystack.gpu-operator for running CUDA pods on management-cluster nodes that already ship the NVIDIA driver + container toolkit from the distro package manager.

Status of the requested changes (2026-06-08 review)

✅ HAMi device-plugin conflict — the Fractional GPU sharing section now explains cozystack.hami and the container variant both register nvidia.com/gpu and aren't a supported combination.
✅ OS support scope — Ubuntu/Debian primary, other distros "not regularly tested."
✅ containerd nvidia runtime registration — nvidia-ctk runtime configure + restart + verify present.
✅ leftover nvidia.com/gpu.workload.config label — prerequisite bullet with removal command added.
✅ CUDA smoke-pod — kubectl wait …Succeeded added before kubectl logs.
✅ host-driver / driver.enabled=false path — reframed clearly; Talos caveat points at the reference values file.

Outstanding

B1 (blocker) — bundles.enabledPackages warning states the wrong failure mechanism — inline at line 41. The text says the bundle "hardcodes spec.variant: default" and "any user Package CR with variant: container is overwritten on the next reconcile." Neither is what happens: iaas.yaml renders the GPU operator via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and fails the Helm render if that value isn't default/vgpu. So container via the bundle path is a hard render error, not a silent overwrite. Keep the conclusion (use a standalone Package CR); fix the reason. Suggested wording inline.

Non-blocking:

#2766 passed helm template + unit tests but no hardware CUDA run — a "provisional pending hardware validation" note would help calibrate trust.
Prerequisite ordering: the nvidia.com/gpu.workload.config removal bullet sits after the containerd-registration block; a node migrating from the passthrough guide would remove the label before/with toolkit registration.

Analysis — where the issues come from

Original code: B1 (wrong bundle-mechanism text) and the ordering nit are both in the initial commit f2ae9b7.
Introduced by post-review fixes: none — the branch is a single commit; no regressions added.
Unresolved from the previous review: none — all six asks addressed.

myasnikovdaniil · 2026-06-10T11:01:59Z

+
+## 1. Install the GPU Operator (container variant)
+
+**Do not** add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry.


The stated reason here is incorrect, though the practical advice is right. gpu-operator in the iaas bundle does not go through the cozystack.platform.package.optional.default helper and does not hardcode spec.variant: default. iaas.yaml renders it via cozystack.platform.package with $gpuVariant = bundles.iaas.gpuOperatorVariant | default "default", and immediately fails the Helm render if that value is anything other than "default" or "vgpu":

{{- if not (or (eq $gpuVariant "default") (eq $gpuVariant "vgpu")) -}} {{- fail (printf "bundles.iaas.gpuOperatorVariant must be \"default\" or \"vgpu\", got %q" $gpuVariant) -}} {{- end -}}

So "container" via the bundle path causes a hard Helm render failure, not a silent overwrite — the user Package CR is never touched because the chart never renders. Suggested replacement:

Do not add cozystack.gpu-operator to bundles.enabledPackages for this variant. The iaas bundle template only accepts bundles.iaas.gpuOperatorVariant: default or vgpu; any other value — including container — causes a hard Helm render failure (packages/core/platform/templates/bundles/iaas.yaml). Apply the Package CR directly instead; the platform controller installs it without a bundle entry and without the variant restriction.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from 3170d45 to 8b83e54 Compare May 28, 2026 18:25

Aleksei Sviridkin (lexfrei) marked this pull request as ready for review May 28, 2026 18:36

Aleksei Sviridkin (lexfrei) requested review from Andrei Kvapil (kvaps) and Timofei Larkin (lllamnyp) as code owners May 28, 2026 18:36

Aleksei Sviridkin (lexfrei) self-assigned this May 28, 2026

Aleksei Sviridkin (lexfrei) mentioned this pull request Jun 2, 2026

Document out-of-the-box GPU passthrough for tenant Kubernetes clusters (gpu=on auto-label + NvLinkDisable default) #561

Open

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from 8b83e54 to b9cae43 Compare June 5, 2026 10:03

myasnikovdaniil requested changes Jun 8, 2026

View reviewed changes

Aleksei Sviridkin (lexfrei) force-pushed the feat/gpu-container-workloads-docs branch from b9cae43 to f2ae9b7 Compare June 8, 2026 14:16

Aleksei Sviridkin (lexfrei) requested a review from myasnikovdaniil June 8, 2026 14:26

myasnikovdaniil requested changes Jun 10, 2026

View reviewed changes

		kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
		-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'


		## Fractional GPU sharing

		The `container` variant exposes whole GPUs through the upstream NVIDIA device plugin. To slice one GPU across multiple pods (memory and compute quotas per pod), enable HAMi on top — HAMi reuses the same device plugin layer and is wired in via the `cozystack.hami` package, which already depends on `cozystack.gpu-operator`. See [GPU Sharing with HAMi](/docs/next/kubernetes/gpu-sharing/) for the tenant Kubernetes flow; for management-cluster workloads the wiring is the same package set with HAMi enabled.


		## 1. Install the GPU Operator (container variant)

		Do not add `cozystack.gpu-operator` to `bundles.enabledPackages` for this variant. The platform Helm chart's optional-package template hardcodes `spec.variant: default` for every name in `enabledPackages` and reconciles the resulting `Package` CR under Helm ownership — any user `Package` CR with `variant: container` is overwritten on the next reconcile. Apply the `Package` CR directly instead; the cozystack platform controller installs it without the bundle entry.

Conversation

Aleksei Sviridkin (lexfrei) commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Release note

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cozystack ready!

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related issues

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

myasnikovdaniil left a comment

Choose a reason for hiding this comment

Uh oh!

myasnikovdaniil Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

myasnikovdaniil Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

myasnikovdaniil Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

myasnikovdaniil Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

myasnikovdaniil Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Aleksei Sviridkin (lexfrei) commented Jun 8, 2026

Uh oh!

myasnikovdaniil left a comment

Choose a reason for hiding this comment

Status of the requested changes (2026-06-08 review)

Outstanding

Analysis — where the issues come from

Uh oh!

myasnikovdaniil Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aleksei Sviridkin (lexfrei) commented May 28, 2026 •

edited

Loading

netlify Bot commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading